InfoMagic Internet Tools 1995 April

home *** CD-ROM | disk | FTP | other *** search

/ InfoMagic Internet Tools 1995 April / Internet Tools.iso / infoserv / www / cern / dev / www-talk.9301-9306.Z / www-talk.9301-9306 / text0547.txt < prev next >

Wrap

Text File | 1995-04-24 | 3.1 KB | 82 lines

>Darn good question. Your approach appears to have the correct >results, but I'm not sure it's practical for many implementations >(global search-and-replace operations are inconvenient for >sequential processing models), and it certainly isn't a healthy >way to think about SGML documents. But most browsers seem to have cacheing anyway, which means they can do global search/replace. But you can still do it more or less sequentially. Just buffer strings of new-lines until you know what follows them, and then deal with it. There's no method you can propose which is correct and doesn't involve storing something somewhere. >The way to think about SGML documents, IMHO, is this: the sequence >of characters comprising an SGML document are presented to an >SGML parser, which parses the markup from the data and passes >the "results" to the processing application. This is another alternative I considered. But I figured that I have to deal with various parsing things when I read the HTML anyway. I was just going to take each chunk of data, (with anchors pre-processed out) and remove all whitespace at the beginning and end (except for PRE sections and such). But if someone put in whitespace, why should I muck with it? Who knows, they might have even wanted it there. >>1. For each tag NOT in >> <PRE> </PRE> <A> </A> <PLAINTEXT> >> remove ALL surrounding new-lines. > >First, let's get one thing straight: the PLAINTEXT element as >described by the original HTML documentation is not representable >in SGML. For my purposes, I consider the HTML document to >end at the <PLAINTEXT> tag, and I consider the rest of the >data stream to be an RFC-822 message body or a MIME text/plain body, >and not SGML at all. I hadn't meant otherwise. But you have to read it in anyway, and since my method deals with things prior to any other parsing, you treat it all as one clump. >Next, let's keep in mind that you can't do things like the following >global substitition, >s/\n+(<(H1|H2|ADDRESS...))>/$2/g; >because it might find things that look like tags but aren't, >for example > ><foo bar=" ><H1>this is a little cooky, but nontheless legal and possible."> > >But even if you're using a proper SGML parser, consider: > ><H1>Here we go! ><a href="#xyz">click here</a> >There we went! ></H1> > >The parser will return an H1 start tag, and then the >string "Here we go!\n". At this point, your rule doesn't >tell me what to do with the newline. I have to get >the next object before I decide. Like I said before, You have to do some sort of storage at some point anyway. >Hmm... I guess that's reasonable. But I'd rather just pass all the Like I said before, You have to do some sort of storage at some point anyway. >My point is: don't use whitespace to represent significant >information except in the PRE elemnt. Use the tags that >are defined to have significance. I suppose I agree with this more or less, at least from the point of view of generating my own code. But we have to make something clear - can a browser keep all the whitespace if it wants to? Or in other words, can an html generator assume collapsing whitespace, or just be aware that it might happen? tom